494 research outputs found

    Multigrain shared memory

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (p. 197-203).by Donald Yeung.Ph.D

    Studying Directory Access Patterns via Reuse Distance Analysis and Evaluating Their Impact on Multi-Level Directory Caches

    Get PDF
    The trend for multicore CPUs is towards increasing core count. One of the key limiters to scaling will be the on-chip directory cache. Our work investigates moving portions of the directory away from the cores, perhaps to off-chip DRAM, where ample capacity exists. While such multi-level directory caches exhibit increased latency, several aspects of directory accesses will shield CPU performance from the slower directory, including low access frequency and latency hiding underneath data accesses to main memory. While multi-level directory caches have been studied previously, no work has of yet comprehensively quantified the directory access patterns themselves, making it difficult to understand multi-level behavior in depth. This paper presents a framework based on multicore reuse distance for studying directory cache access patterns. Using our analysis framework, we show between 69-93% of directory entries are looked up only once or twice during their liftimes in the directory cache, and between 51-71% of dynamic directory accesses are latency tolerant. Using cache simulations, we show a very small L1 directory cache can service 80% of latency critical directory lookups. Although a significant number of directory lookups and eviction notifications must access the slower L2 directory cache, virtually all of these are latency tolerant

    Exploiting Multi-Loop Parallelism on Heterogeneous Microprocessors

    Get PDF
    Heterogeneous microprocessors integrate CPUs and GPUs on the same chip, providing fast CPU-GPU communication and enabling cores to compute on data "in place." These advantages will permit integrated GPUs to exploit a smaller unit of parallelism. But one challenge will be exposing sufficient parallelism to keep all of the on-chip compute resources fully utilized. In this paper, we argue that integrated CPU-GPU chips should exploit parallelism from multiple loops simultaneously. One example of this is nested parallelism in which one or more inner SIMD loops are nested underneath a parallel outer (non- SIMD) loop. By scheduling the parallel outer loop on multiple CPU cores, multiple dynamic instances of the inner SIMD loops can be scheduled on the GPU cores. This boosts GPU utilization and parallelizes the non-SIMD code. Our preliminary results show exploiting such multi-loop parallelism provides a 3.12x performance gain over exploiting parallelism from individual loops one at a time

    Pipelined CPU-GPU Scheduling for Caches

    Get PDF
    Heterogeneous microprocessors integrate a CPU and GPU with a shared cache hierarchy on the same chip, affording low-overhead communication between the CPU and GPU's cores. Often times, large array data structures are communicated from the CPU to the GPU and back. While the on-chip cache hierarchy can support such CPU-GPU producer-consumer sharing, this almost never happens due to poor temporal reuse. Because the data structures can be quite large, by the time the consumer reads the data, it has been evicted from cache even though the producer had brought it on-chip when it originally wrote the data. As a result, the CPU-GPU communication happens through main memory instead of the cache, hurting performance and energy. This paper exploits the on-chip caches in a heterogeneous microprocessor to improve CPU-GPU communication efficiency. We divide streaming computations executed by the CPU and GPU that exhibit producer-consumer sharing into chunks, and overlap the execution of CPU chunks with GPU chunks in a software pipeline. To enforce data dependences, the producer executes one chunk ahead of the consumer at all times. We also propose a low-overhead synchronization mechanism in which the CPU directly controls thread-block scheduling in the GPU to maintain the producer's "run-ahead distance" relative to the consumer. By adjusting the chunk size or run-ahead distance, we can make the CPU-GPU working set fit in the last-level cache, thus permitting the producer-consumer sharing to occur through the LLC. We show through simulation that our technique reduces the number of DRAM accesses by 30.4%, improves performance by 26.8%, and lowers memory system energy by 27.4% averaged across 7 benchmarks

    Symbiotic Cache Resizing for CMPs with Shared LLC

    Get PDF
    This paper investigates the problem of finding the optimal sizes of private caches and a shared LLC in CMPs. Resizing private and shared caches in modern CMPs is one way to squeeze wasteful power consumption out of architectures to improve power efficiency. However, shrinking each private/shared cache has different impact on the performance loss and the power savings to the CMPs because each cache contributes differently to performance and power. It is beneficial for both performance and power to shrink the LRU way of the private/shared cache which saves power most and increases data traffic least. This paper presents Symbiotic Cache Resizing (SCR), a runtime technique that reduces the total power consumption of the on-chip cache hierarchy in CMPs with a shared LLC. SCR turnoffs private/shared-cache ways in an inter-core and inter-level manner so that each disabling achieves best power saving while maintaining high performance. SCR finds such optimal cache sizes by utilizing greedy algorithms that we develop in this study. In particular, Prioritized Way Selection picks the most power-inefficient way. LLC-Partitioning-aware Prioritized Way Selection finds optimal cache sizes from the multi-level perspective. Lastly, Weighted Threshold Throttling finds optimal threshold per cache level. We evaluate SCR in two-core, four-core and eight-core systems. Results show that SCR saves 13% power in the on-chip cache hierarchy and 4.2% power in the system compared to an even LLC partitioning technique. SCR saves 2.7X more power in the cache hierarchy than the state-of-the-art LLC resizing technique while achieving better performance

    Studying the impact of multicore processor scaling on directory techniques via reuse distance analysis

    Full text link
    Abstract—Researchers have proposed numerous directory techniques to address multicore scalability whose behavior de-pends on the CPU’s particular configuration, e.g. core count and cache size. As CPUs continue to scale, it is essential to explore the directory’s architecture dependences. However, this is challenging using detailed simulation given the large number of CPU configurations that are possible. This paper proposes to use multicore reuse distance analysis to study coherence directories. We develop a framework to extract the directory access stream from parallel LRU stacks, enabling rapid analysis of the directory’s accesses and contents across both core count and cache size scaling. We also implement our framework in a profiler, and apply it to gain insights into multicore scaling’s impact on the directory. Our profiling results show that directory accesses reduce by 3.5x across data cache size scaling, suggesting techniques that tradeoff access latency for reduced capacity or conflicts become increasingly effective as cache size scales. We also show the portion of on-chip memory devoted to the directory cache can be reduced by 53.3 % across data cache size scaling, thus lowering the over-provisioning needed at large cache sizes. Finally, we validate our RD-based directory analyses, and find they are within 13% of cache simulations in terms of access count, on average. I

    Parallelization of the SSCA#3 Benchmark on the RAW Processor

    Get PDF
    The MIT Raw machine provides a point-to-point interconnection network for transferring register values between tiles. The programmer schedules the network communication for each tile by himself/herself and guarantees the correctness. It is not easy to parallelize benchmarks by hand for all possible tile configurations on the Raw processor. To overcome this problem, we develop a communication library and a switch code generator to create the switch code for each tile automatically. We implement our techniques for the SSCA#3 (SAR Sensor Processing, Knowledge Formation) benchmark, and evaluate the parallelism on a physical Raw processor. The experimental results show the SSCA#3 benchmark has dense matrix operations with abundant parallelism. Using 16 tiles, the ’SAR image formation’ procedure achieves a speedup of 13.86, and the speedup of the ’object detection’ procedure is 9.98

    Memory Performance Analysis for Parallel Programs Using Concurrent Reuse Distance

    Get PDF
    Performance on multicore processors is determined largely by on-chip cache. Computer architects have conducted numerous studies in the past that vary core count and cache capacity as well as problem size to understand impact on cache behavior. These studies are very costly due to the combinatorial design spaces they must explore. Reuse distance (RD) analysis can help architects explore multicore cache performance more efficiently. One problem, however, is multicore RD analysis requires measuring concurrent reuse distance (CRD) profiles across thread-interleaved memory reference streams. Sensitivity to memory interleaving makes CRD profiles architecture dependent, undermining RD analysis benefits. But for parallel programs with symmetric threads, CRD profiles vary with architecture tractably: they change only slightly with cache capacity scaling, and shift predictably to larger CRD values with core count scaling. This enables analysis of a large number of multicore configurations from a small set of measured CRD profiles. This paper investigates using RD analysis to efficiently analyze multicore cache performance for parallel programs, making several contributions. First, we characterize how CRD profiles change with core count and cache capacity. One of our findings is core count scaling degrades locality, but the degradation only impacts last-level caches (LLCs) below 16MB for our benchmarks and problem sizes, increasing to 128MB if problem size scales by 64x. Second, we apply reference groups to predict CRD profiles across core count scaling, and evaluate prediction accuracy. Finally, we use CRD profiles to analyze multicore cache performance. We find predicted CRD profiles can estimate LLC MPKI within 76% of simulation for configurations without pathologic cache conflicts in 1/1200th the time to perform simulation of the full design space

    Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis

    Full text link
    Understanding multicore memory behavior is crucial, but can be challenging due to the complex cache hierarchies em-ployed in modern CPUs. In today’s hierarchies, performance is determined by complicated thread interactions, such as interference in shared caches and replication and communi-cation in private caches. Researchers normally perform ex-tensive simulations to study these interactions, but this can be costly and not very insightful. An alternative is multicore reuse distance (RD) analysis, which can provide extremely rich information about multicore memory behavior. In this paper, we apply multicore RD analysis to better understand cache system design. We focus on loop-based parallel pro-grams, an important class of programs for which RD anal-ysis provides high accuracy. We propose a novel framework to identify optimal multicore cache hierarchies, and extract several new insights. We also characterize how the optimal cache hierarchies vary with core count and problem size
    • …
    corecore